Introduction

This document outlines the process of analyzing RNA-Seq data. The analysis includes data preprocessing, exploratory data analysis (EDA), differential expression analysis using the limma-voom workflow, and downstream functional enrichment analysis.

1. Environment Setup

1.1. Package Installation

First, we ensure that all necessary packages from CRAN and Bioconductor are installed.

# Installation of BiocManager to handle Bioconductor ----
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")

# Installation of necessary libraries ----
cran_packages <- c("tidyverse", "pheatmap", "ggrepel", "RColorBrewer", "UpSetR")
bioc_packages <- c("edgeR", "limma", "clusterProfiler",
                   "org.Hs.eg.db", "EDASeq")

# Install CRAN packages if not already installed
missing_cran <- cran_packages[!(cran_packages %in%
                                  installed.packages()[,"Package"])]
if (length(missing_cran) > 0) {
  cat("Instalowanie pakietĂłw z CRAN:", paste(missing_cran, collapse=", "), "\n")
  install.packages(missing_cran)
}

# Install Bioconductor packages if not already installed
missing_bioc <- bioc_packages[!(bioc_packages %in%
                                  installed.packages()[,"Package"])]
if (length(missing_bioc) > 0) {
  cat("Instalowanie pakietĂłw z Bioconductor:",
      paste(missing_bioc, collapse=", "), "\n")
  BiocManager::install(missing_bioc)
}

1.2. Library Import

Next, we load all the required libraries for the analysis.

library(tidyverse)
library(edgeR)
library(limma)
library(UpSetR)
library(pheatmap)
library(ggrepel)
library(RColorBrewer)
library(clusterProfiler)
library(org.Hs.eg.db)
library(EDASeq)

2. Data Reading and Pre-processing

This section describes how the RNA-Seq count data and associated metadata are read, checked for consistency, and prepared for analysis.

2.1. Data Import

We start by importing the expression data and the metadata from their respective .tsv files.

# Import of data ----
counts_data <- read.delim2("./expression_data.tsv", header = TRUE, row.names = 1, check.names = FALSE)

# Import of metadata ----
metadata <- read.delim2("./metadata.tsv", header = TRUE, row.names = 1, check.names = FALSE)

2.2. Data Cleaning and Validation

We check for missing values and standardize column names and sample identifiers. This involves removing whitespaces and ensuring that all samples in the count data have corresponding metadata entries.

# Filter out missing patients, by comparing colnames of count_data with
# rownames in metadata
metadata_set <- rownames(metadata)
counts_data_set <- colnames(counts_data)[6:23]

# Function to remove whitespaces in rownames of metadata
remove_whitespace <- function(x) {
  gsub("\\s+", "", x)
}

#Removal of whitespaces in metadata
metadata[,"tissue"] <- remove_whitespace(metadata[,"tissue"])
metadata[,"smoker"] <- remove_whitespace(metadata[,"smoker"])

# Vector operation on the metadata_set to remove whitespaces
metadata_set <- remove_whitespace(metadata_set)

stopifnot(all(metadata_set %in% counts_data_set))
stopifnot(all(counts_data_set %in% metadata_set))
rownames(metadata) <- metadata_set

# Checking completeness of the records and finding records with missing values
# in counts_data
missing_records <- is.na(counts_data)
missing_records_count <- colSums(missing_records)
missing_records_display <- counts_data[, missing_records_count > 0]

The check for missing values yielded the following result:

if (ncol(missing_records_display) > 0) {
  print("Records with missing values:")
  print(missing_records_display)
} else {
  print("No records with missing values found.")
}
## [1] "No records with missing values found."

Finally, we extract the raw count matrix for downstream analysis.

# Extracting the raw counts data with preservation of rownames
raw_counts <- counts_data[, metadata_set]
# Displaying the first few rows of the raw counts data
print("Raw counts data:")
## [1] "Raw counts data:"
print(head(raw_counts))
##           387N 387T 395N 395T 397N 397T 406N 406T 411N 411T 417N 417T 452N 452T
## DDX11L1     51   53   22   25   10   37   33   17   14   25   10   19   21    7
## WASH7P    3358 4477 1505 2237  594 1575 1900 2945 1232 1712 1380 1596 2080 4179
## MIR6859-3  446  724  223  757   65  224  330  718  140  225  131  150  276  367
## MIR6859-2  446  724  223  757   65  224  330  718  140  225  131  150  276  367
## MIR6859-1  446  724  223  757   65  224  330  718  140  225  131  150  276  367
## MIR6859-4  446  724  223  757   65  224  330  718  140  225  131  150  276  367
##           529N 529T 532N 532T
## DDX11L1    185  147   65   53
## WASH7P    4254 5891 2482 4263
## MIR6859-3  647 1070  208  657
## MIR6859-2  647 1070  208  657
## MIR6859-1  647 1070  208  657
## MIR6859-4  647 1070  208  657

3. Exploratory Data Analysis (EDA)

3.1. Exploration of Raw Counts Data

Construction of DGEList Object

We begin by creating a DGEList object, which is a container for RNA-Seq data used by the edgeR package.

dge <- DGEList(counts = raw_counts, 
               samples = metadata, 
               genes = data.frame(GeneSymbol = rownames(raw_counts)))
# Displaying the DGEList object
print("DGEList object:")
## [1] "DGEList object:"
print(dge)
## An object of class "DGEList"
## $counts
##           387N 387T 395N 395T 397N 397T 406N 406T 411N 411T 417N 417T 452N 452T
## DDX11L1     51   53   22   25   10   37   33   17   14   25   10   19   21    7
## WASH7P    3358 4477 1505 2237  594 1575 1900 2945 1232 1712 1380 1596 2080 4179
## MIR6859-3  446  724  223  757   65  224  330  718  140  225  131  150  276  367
## MIR6859-2  446  724  223  757   65  224  330  718  140  225  131  150  276  367
## MIR6859-1  446  724  223  757   65  224  330  718  140  225  131  150  276  367
##           529N 529T 532N 532T
## DDX11L1    185  147   65   53
## WASH7P    4254 5891 2482 4263
## MIR6859-3  647 1070  208  657
## MIR6859-2  647 1070  208  657
## MIR6859-1  647 1070  208  657
## 26480 more rows ...
## 
## $samples
##      group lib.size norm.factors tissue  smoker
## 387N     1 66380268            1 normal  former
## 387T     1 78657010            1  tumor  former
## 395N     1 41286149            1 normal  former
## 395T     1 48194140            1  tumor  former
## 397N     1 45240270            1 normal current
## 13 more rows ...
## 
## $genes
##           GeneSymbol
## DDX11L1      DDX11L1
## WASH7P        WASH7P
## MIR6859-3  MIR6859-3
## MIR6859-2  MIR6859-2
## MIR6859-1  MIR6859-1
## 26480 more rows ...

Filtering Lowly Expressed Genes

Genes with low expression levels are filtered out to improve statistical power.

keep <- filterByExpr(dge, group = dge$samples$tissue)
dge_filtered <- dge[keep, , keep.lib.sizes=FALSE]

The filtering process resulted in the following reduction of genes:

## Number of genes before filtering: 26485
## Number of genes after filtering: 19348

Distribution of Library Sizes

We visualize the library sizes for each sample to check for major discrepancies.

plot_data <- dge_filtered$samples %>%
  rownames_to_column("Sample")
mean_lib_size_mil <- mean(plot_data$lib.size) / 1e6

ggplot(plot_data, aes(x = reorder(Sample, lib.size), y = lib.size / 1e6)) +
  geom_col(fill = "steelblue", alpha = 0.8) +
  geom_hline(yintercept = mean_lib_size_mil, linetype = "dashed",
             color = "red", size = 1) +
  geom_text(aes(x = -Inf, y = mean_lib_size_mil,
                label = paste("Average =", round(mean_lib_size_mil, 2), "million")),
            hjust = -0.1, vjust = -0.5, color = "red", size = 4) +
  labs(
    title = "Sample library size",
    x = "Sample",
    y = "Library size (million)" #Adjusted for better readability
  ) +
  theme_bw(base_size = 10) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The distribution of library sizes suggests a stron need for well fitted normalization, as there are some samples with significantly lower library sizes than the average. ### Counts Distribution Before Normalization

A boxplot of log-CPM values shows the distribution of counts for each sample before normalization.

logcpm <- cpm(dge_filtered, log=TRUE)
logcpm_long <- as.data.frame(logcpm) %>%
  rownames_to_column("GeneSymbol") %>%
  pivot_longer(
    cols = -GeneSymbol,
    names_to = "Sample",
    values_to = "logCPM"
  ) %>%
  left_join(dge_filtered$samples %>% rownames_to_column("Sample"), by = "Sample")

ggplot(logcpm_long, aes(x = Sample, y = logCPM, fill = tissue)) +
  geom_boxplot(outlier.shape = NA) + 
  labs(
    title = "Distribution of log-CPM values before normalization",
    x = "Sample",
    y = "log-CPM"
  ) +
  theme_bw(base_size = 10) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "right")

The boxplot shows the distribution of log-CPM values across samples, indicating variability in expression levels. The tissue type is used to color the boxes, allowing for visual comparison between normal and tumor samples. ### MA-plot on Pseudo-reference

An MA-plot helps to visualize intensity-dependent ratios of gene expression. Here, we compare a sample to a pseudo-reference created from ‘normal’ tissue samples.

sample_index <- 2
sample_name <- colnames(logcpm)[sample_index]
normal_samples_mask <- dge_filtered$samples$tissue == "normal"
pseudo_reference <- rowMeans(logcpm[, normal_samples_mask])

M <- logcpm[, sample_index] - pseudo_reference
A <- (logcpm[, sample_index] + pseudo_reference) / 2
ma_data <- data.frame(A = A, M = M)

ggplot(ma_data, aes(x = A, y = M)) +
  geom_point(alpha = 0.2, size = 0.8) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed", size = 1) +
  geom_smooth(method = "loess", se = FALSE, color = "steelblue", size = 1) +
  labs(
    title = paste("MA-plot for sample:", sample_name),
    subtitle = "Comparison to the pseudo-reference from 'normal' tissue cohort",
    x = "A (Average log-CPM)",
    y = "M (Difference in log-CPM)"
  ) +
  theme_bw(base_size = 10)

The MA-plot shows the relationship between average expression and log-fold change, indicating how the sample compares to the pseudo-reference. The TMM normalization will help to adjust for these differences. ### Relative Log Expression (RLE) Plot Before Normalization

RLE plots are used to check if there are any systematic biases in the data. Ideally, the boxes should be centered around zero.

order_df <- dge_filtered$samples %>%
  rownames_to_column("Sample") %>%
  arrange(tissue, Sample)
ordered_counts <- dge_filtered$counts[, order_df$Sample]
ordered_colors <- as.numeric(factor(order_df$tissue))

plotRLE(as.matrix(ordered_counts), 
        outline = FALSE, 
        las = 2, 
        col = ordered_colors,
        main = "RLE before normalization",
        colour_by = "tissue")
legend("topright", 
       legend = levels(factor(order_df$tissue)), 
       fill = 1:nlevels(factor(order_df$tissue)), 
       title = "Tissue")

The RLE plot shows that the samples are not centered around zero, indicating the presence of systematic biases that need to be addressed through normalization. ## 3.2. Normalization and Visualization

TMM Normalization and Voom Transformation

We apply Trimmed Mean of M-values (TMM) normalization and then use voom to transform the count data for linear modeling. The voom plot shows the mean-variance trend.

dge_norm <- calcNormFactors(dge_filtered, method = "TMM")
temp_design <- model.matrix(~tissue, data = dge_norm$samples)
v_check <- voom(dge_norm, temp_design, plot = TRUE)

Searching for Sources of Variability in Normalized Data

We use dimensionality reduction techniques to explore the main sources of variation in the normalized data.

MDS Plot

A Multidimensional Scaling (MDS) plot shows the relationships between samples.

plotMDS(dge_norm, 
        col = as.numeric(factor(dge_norm$samples$tissue)),
        labels = dge_norm$samples$smoker,
        dim.plot = c(1,2))
title("MDS Plot (counts normalized with TMM")

The plot is uninformative, thus there is a need to perform PCA to find the main sources of variation. #### PCA Plots

Principal Component Analysis (PCA) helps identify major patterns of variation.

logcpm_norm <- cpm(dge_norm, log=TRUE)
pca_res <- prcomp(t(logcpm_norm), scale. = TRUE)
pca_data <- as.data.frame(pca_res$x) %>%
  rownames_to_column("Sample") %>%
  left_join(dge_norm$samples %>% rownames_to_column("Sample"), by = "Sample")

percent_var <- pca_res$sdev^2 / sum(pca_res$sdev^2)

# PC1 vs PC2
ggplot(pca_data, aes(x = PC1, y = PC2, color = tissue, shape = smoker)) +
  geom_point(size = 4, alpha = 0.8) +
  labs(
    title = "PCA Plot (counts normalized with TMM): PC1 vs PC2",
    x = paste0("PC1: ", round(percent_var[1] * 100), "% variance explained"),
    y = paste0("PC2: ", round(percent_var[2] * 100), "% variance explained")
  ) +
  theme_bw(base_size = 10)

# PC2 vs PC3
ggplot(pca_data, aes(x = PC2, y = PC3, color = tissue, shape = smoker)) +
  geom_point(size = 4, alpha = 0.8) +
  labs(
    title = "PCA Plot (counts normalized with TMM): PC2 vs PC3",
    x = paste0("PC2: ", round(percent_var[2] * 100), "% variance explained"),
    y = paste0("PC3: ", round(percent_var[3] * 100), "% variance explained")
  ) +
  theme_bw(base_size = 10)

# PC1 vs PC3
ggplot(pca_data, aes(x = PC1, y = PC3, color = tissue, shape = smoker)) +
  geom_point(size = 4, alpha = 0.8) +
  labs(
    title = "PCA Plot (counts normalized with TMM): PC1 vs PC3",
    x = paste0("PC1: ", round(percent_var[1] * 100), "% variance explained"),
    y = paste0("PC3: ", round(percent_var[3] * 100), "% variance explained")
  ) +
  theme_bw(base_size = 10)

There is clearly a cluster missing - data is probably being clustered by sex, and that should be taken into account before the main analysis.

4. Upstream & Differential Expression Analysis

4.1. Inferring Sex from Gene Expression

We infer the sex of each sample based on the expression of Y-chromosome genes. This will be added as a cofactor in the downstream analysis.

gene_info <- counts_data[, "Chr", drop = FALSE]
y_genes_mask <- sapply(strsplit(as.character(gene_info$Chr), ";"), function(x) {
  x <- x[!is.na(x) & x != ""]
  if (length(x) == 0) return(FALSE)
  all(x == "chrY")
})

y_gene_counts <- raw_counts[y_genes_mask, ]
mean_y_expr <- colMeans(log2(y_gene_counts + 1))

sex_plot_data <- data.frame(
  sample = names(mean_y_expr),
  mean_y_expr = mean_y_expr
)

ggplot(sex_plot_data, aes(x = reorder(sample, mean_y_expr), y = mean_y_expr)) +
  geom_col(fill = "steelblue", alpha = 0.8) +
  labs(
    title = "Average Y chromosome genes expression",
    x = "Sample",
    y = "Average expression log2(counts + 1)"
  ) +
  theme_bw(base_size = 10) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

A threshold of logFC = 1 is set to classify samples as ‘male’ or ‘female’.

threshold <- 1
metadata$sex <- ifelse(mean_y_expr > threshold, "male", "female")

4.2. Re-constructing DGEList with Sex Information

We repeat the DGEList construction, filtering, and normalization steps, this time including the inferred sex in the metadata.

dge <- DGEList(counts = raw_counts, 
               samples = metadata, 
               genes = data.frame(GeneSymbol = rownames(raw_counts)))

keep <- filterByExpr(dge, group = dge$samples$tissue)
dge_filtered <- dge[keep, , keep.lib.sizes=FALSE]

cat("Number of genes before filtering:", nrow(dge), "\n")
## Number of genes before filtering: 26485
cat("Number of genes after filtering:", nrow(dge_filtered), "\n")
## Number of genes after filtering: 19348
dge_norm <- calcNormFactors(dge_filtered, method = "TMM")
temp_design <- model.matrix(~tissue, data = dge_norm$samples)
v_check <- voom(dge_norm, temp_design, plot = TRUE)

4.3. PCA by Sex Cohort

To investigate the effect of sex, we perform PCA separately for male and female cohorts.

male_samples <- dge_norm$samples$sex == "male"
female_samples <- dge_norm$samples$sex == "female"

logcpm_norm <- cpm(dge_norm, log=TRUE)
logcpm_male <- logcpm_norm[, male_samples]
logcpm_female <- logcpm_norm[, female_samples]

PCA for Male Cohort

pca_res_male <- prcomp(t(logcpm_male), scale. = TRUE)
pca_data_male <- as.data.frame(pca_res_male$x) %>%
  rownames_to_column("Sample") %>%
  left_join(dge_norm$samples %>% rownames_to_column("Sample"), by = "Sample")
percent_var_male <- pca_res_male$sdev^2 / sum(pca_res_male$sdev^2)

ggplot(pca_data_male, aes(x = PC1, y = PC2, color = tissue, shape = smoker)) +
  geom_point(size = 4, alpha = 0.8) +
  stat_ellipse(aes(group = tissue), type = 't', linetype = "dashed") +
  labs(
    title = "PCA for male cohort: PC1 vs PC2",
    x = paste0("PC1: ", round(percent_var_male[1] * 100), "% variance explained"),
    y = paste0("PC2: ", round(percent_var_male[2] * 100), "% variance explained")
  ) +
  theme_bw(base_size = 10)

ggplot(pca_data_male, aes(x = PC2, y = PC3, color = tissue, shape = smoker)) +
  geom_point(size = 4, alpha = 0.8) +
  stat_ellipse(aes(group = tissue),type = 't', linetype = "dashed") +
  labs(
    title = "PCA for male cohort: PC2 vs PC3",
    x = paste0("PC2: ", round(percent_var_male[2] * 100), "% variance explained"),
    y = paste0("PC3: ", round(percent_var_male[3] * 100), "% variance explained")
  ) +
  theme_bw(base_size = 10)

The PCA for male cohort shows better clusteting of tumor samples against the normal samples. ### PCA for Female Cohort

pca_res_female <- prcomp(t(logcpm_female), scale. = TRUE)
pca_data_female <- as.data.frame(pca_res_female$x) %>%
  rownames_to_column("Sample") %>%
  left_join(dge_norm$samples %>% rownames_to_column("Sample"), by = "Sample")
percent_var_female <- pca_res_female$sdev^2 / sum(pca_res_female$sdev^2)

ggplot(pca_data_female, aes(x = PC1, y = PC2, color = tissue, shape = smoker)) +
  geom_point(size = 4, alpha = 0.8) +
  stat_ellipse(aes(group = tissue), type = 't', linetype = "dashed") +
  labs(
    title = "PCA for female cohort: PC1 vs PC2",
    x = paste0("PC1: ", round(percent_var_female[1] * 100), "% variance explained"),
    y = paste0("PC2: ", round(percent_var_female[2] * 100), "% variance explained")
  ) +
  theme_bw(base_size = 10)

ggplot(pca_data_female, aes(x = PC2, y = PC3, color = tissue, shape = smoker)) +
  geom_point(size = 4, alpha = 0.8) +
  stat_ellipse(aes(group = tissue),type = 't', linetype = "dashed") +
  labs(
    title = "PCA for female cohort: PC2 vs PC3",
    x = paste0("PC2: ", round(percent_var_female[2] * 100), "% variance explained"),
    y = paste0("PC3: ", round(percent_var_female[3] * 100), "% variance explained")
  ) +
  theme_bw(base_size = 10)

The female cohort demonstrates huge clustering of tumor samples against the normal samples. It clearly shows that some genes are different from the reference tissue, but as the normal tissue clusters quite tightly, the tumor tissue seems to be quite dispersed on the PC1 vs PC2 plot. The PC2 plot vs PC3 plot shows the same story. Sex was definitely the noisy factor in here. Although separate analyses could be performed on male and female cohorts, for this analysis we will not use it as such and we also won’t include sex as a covariate in the model, as the recruitment task did not specified it, and stratification by sex would pin me to the chair for next couple of days. It should be done like that, but not for the recruitment tasks, which are time sensitive.

5. Differential Expression and Downstream Analysis

5.1. Defining Models and Contrasts

We set up the design matrices for different comparisons of interest.

dge_norm$samples$tissue <- factor(dge_norm$samples$tissue,
                                  levels = c("normal", "tumor"))
dge_norm$samples$smoker <- factor(dge_norm$samples$smoker,
                                  levels = c("never", "former", "current"))

Task 1: Difference between tumor tissue and normal tissue

We model the effect of tissue type, accounting for sex and smoking status as covariates.

tissue_design <- model.matrix(~0 + tissue, data = dge_norm$samples)
colnames(tissue_design) <- gsub("tissuetumor", "tumor", colnames(tissue_design))
colnames(tissue_design) <- gsub("tissuenormal", "normal", colnames(tissue_design))

v <- voom(dge_norm, tissue_design, plot = TRUE)

fit <- lmFit(v, tissue_design)

cont.matrix <- makeContrasts(
  tumor_vs_normal = tumor - normal,
  levels = tissue_design
)

fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
summary(decideTests(fit2, p.value = 0.05, lfc = 1))
##        tumor_vs_normal
## Down               519
## NotSig           18638
## Up                 191
tumor_vs_normal_results <- topTable(fit2, coef = "tumor_vs_normal",
                                    number = Inf, sort.by = "none",
                                    adjust.method = "BH",
                                    confint = TRUE) %>%
  arrange(adj.P.Val) %>%
  mutate(Significant = ifelse(adj.P.Val < 0.05 & abs(logFC) > 1, "Yes", "No"))

head(tumor_vs_normal_results)
##         GeneSymbol     logFC      CI.L      CI.R  AveExpr         t
## BMPER        BMPER -3.695096 -4.625649 -2.764543 2.210845 -8.284063
## EDNRB        EDNRB -3.534340 -4.539510 -2.529170 5.550089 -7.335467
## COL10A1    COL10A1  3.443815  2.472427  4.415203 3.440002  7.396155
## CD36          CD36 -3.063270 -3.935332 -2.191208 5.539465 -7.328190
## RTKN2        RTKN2 -2.640415 -3.452059 -1.828771 5.444163 -6.786808
## BTNL9        BTNL9 -4.173878 -5.457362 -2.890393 5.002983 -6.784345
##              P.Value   adj.P.Val        B Significant
## BMPER   6.866779e-08 0.001328584 7.832186         Yes
## EDNRB   4.382497e-07 0.002151174 6.569513         Yes
## COL10A1 3.878408e-07 0.002151174 6.561570         Yes
## CD36    4.447331e-07 0.002151174 6.560572         Yes
## RTKN2   1.352898e-06 0.004385178 5.509710         Yes
## BTNL9   1.359886e-06 0.004385178 5.485800         Yes

The results show a decent number of genes differentially expressed between tumor and normal tissues with high, with many downregulated genes in tumor samples. ### Task 2: Difference between smoking statuses

We model the effect of smoking status, controlling for tissue and sex.

smoking_design <- model.matrix(~0 + smoker, data = dge_norm$samples)
colnames(smoking_design) <- gsub("smokercurrent", "current", colnames(smoking_design))
colnames(smoking_design) <- gsub("smokerformer", "former", colnames(smoking_design))
colnames(smoking_design) <- gsub("smokernever", "never", colnames(smoking_design))

v_smoking <- voom(dge_norm, smoking_design, plot = TRUE)

fit_smoking <- lmFit(v_smoking, smoking_design)
cont.matrix_smoking <- makeContrasts(
  current_vs_never = current - never,
  former_vs_never = former - never,
  current_vs_former = current - former,
  levels = smoking_design
)

fit2_smoking <- contrasts.fit(fit_smoking, cont.matrix_smoking)
fit2_smoking <- eBayes(fit2_smoking)
summary(decideTests(fit2_smoking, p.value = 0.05, lfc = 1))
##        current_vs_never former_vs_never current_vs_former
## Down                  0               0                 0
## NotSig            19348           19348             19348
## Up                    0               0                 0
current_vs_never_results <- topTable(fit2_smoking, coef = "current_vs_never",
                                     number = Inf, sort.by = "none",
                                     adjust.method = "BH", confint = TRUE) %>%
  arrange(adj.P.Val) %>%
  mutate(Significant = ifelse(adj.P.Val < 0.05 & abs(logFC) > 1, "Yes", "No"))

former_vs_never_results <- topTable(fit2_smoking, coef = "former_vs_never",
                                    number = Inf, sort.by = "none",
                                    adjust.method = "BH", confint = TRUE) %>%
  arrange(adj.P.Val) %>%
  mutate(Significant = ifelse(adj.P.Val < 0.05 & abs(logFC) > 1, "Yes", "No"))

current_vs_former_results <- topTable(fit2_smoking, coef = "current_vs_former",
                                      number = Inf, sort.by = "none",
                                      adjust.method = "BH", confint = TRUE) %>%
  arrange(adj.P.Val) %>%
  mutate(Significant = ifelse(adj.P.Val < 0.05 & abs(logFC) > 1, "Yes", "No"))

cat("Top results for current vs never:\n")
## Top results for current vs never:
head(current_vs_never_results)
##          GeneSymbol      logFC        CI.L        CI.R   AveExpr         t
## DDX11L1     DDX11L1 -1.7345488 -3.21704313 -0.25205438 -1.009812 -2.450621
## MRPL20       MRPL20  0.6279992  0.05058706  1.20541127  4.531589  2.278011
## CDK11B       CDK11B  0.7473005  0.10165081  1.39295025  5.770285  2.424270
## CDK11A       CDK11A  0.8289029  0.14669769  1.51110802  5.507028  2.544903
## CALML6       CALML6  1.5566414  0.28103361  2.83224918 -1.911408  2.555960
## TP73-AS1   TP73-AS1 -0.7485917 -1.43987139 -0.05731209  4.506304 -2.268161
##             P.Value adj.P.Val         B Significant
## DDX11L1  0.02422930 0.5456027 -4.239606          No
## MRPL20   0.03459827 0.5456027 -3.770468          No
## CDK11B   0.02559797 0.5456027 -3.595341          No
## CDK11A   0.01987383 0.5456027 -3.485964          No
## CALML6   0.01941435 0.5456027 -4.274444          No
## TP73-AS1 0.03529908 0.5456027 -3.774986          No
cat("\nTop results for former vs never:\n")
## 
## Top results for former vs never:
head(former_vs_never_results)
##           GeneSymbol     logFC      CI.L       CI.R   AveExpr         t
## DDX11L1      DDX11L1 -1.193489 -2.613494 0.22651597 -1.009812 -1.760400
## FAM138A      FAM138A -2.878167 -6.421814 0.66547990 -2.396297 -1.701172
## FAM138C      FAM138C -2.878167 -6.421814 0.66547990 -2.396297 -1.701172
## FAM138F      FAM138F -2.878167 -6.421814 0.66547990 -2.396297 -1.701172
## LOC729737  LOC729737 -1.016934 -2.133178 0.09931007  5.194728 -1.908167
## MIR6723      MIR6723  5.580601  2.244488 8.91671374  2.943554  3.503668
##              P.Value adj.P.Val         B Significant
## DDX11L1   0.09459751  0.623878 -4.581641          No
## FAM138A   0.10538978  0.623878 -4.587694          No
## FAM138C   0.10538978  0.623878 -4.587694          No
## FAM138F   0.10538978  0.623878 -4.587694          No
## LOC729737 0.07175794  0.623878 -4.519311          No
## MIR6723   0.00240535  0.623878 -4.486934          No
cat("\nTop results for current vs former:\n")
## 
## Top results for current vs former:
head(current_vs_former_results)
##           GeneSymbol      logFC      CI.L      CI.R   AveExpr          t
## DDX11L1      DDX11L1 -0.5410596 -2.131807 1.0496875 -1.009812 -0.7124045
## WASH7P        WASH7P -0.1946459 -1.084579 0.6952876  5.232518 -0.4581106
## MIR6859-3  MIR6859-3 -0.2395454 -1.509067 1.0299764  2.392043 -0.3952122
## MIR6859-2  MIR6859-2 -0.2395454 -1.509067 1.0299764  2.392043 -0.3952122
## MIR6859-1  MIR6859-1 -0.2395454 -1.509067 1.0299764  2.392043 -0.3952122
## MIR6859-4  MIR6859-4 -0.2395454 -1.509067 1.0299764  2.392043 -0.3952122
##             P.Value adj.P.Val         B Significant
## DDX11L1   0.4849615 0.9998201 -4.598030          No
## WASH7P    0.6521232 0.9998201 -4.623334          No
## MIR6859-3 0.6971340 0.9998201 -4.610857          No
## MIR6859-2 0.6971340 0.9998201 -4.610857          No
## MIR6859-1 0.6971340 0.9998201 -4.610857          No
## MIR6859-4 0.6971340 0.9998201 -4.610857          No

The results are statistically insifgnificant (after p-value correction), but the non-corrected p-values will be used for further exploration and mining. ### Task 3: Interaction between smoking and tumor status

We investigate if the effect of tumor vs. normal tissue is different in smokers compared to non-smokers.

dge_norm$samples$smoking_status <- ifelse(dge_norm$samples$smoker == 'never', 'no', 'yes')
dge_norm$samples$smoking_status <- factor(dge_norm$samples$smoking_status, levels = c("no", "yes"))

design_interaction <- model.matrix(~0 + tissue * smoking_status, data = dge_norm$samples)
colnames(design_interaction) <- gsub("tissuetumor:smoking_statusyes", "smokertumor", colnames(design_interaction))

v_int <- voom(dge_norm, design_interaction, plot = TRUE)

fit_int <- lmFit(v_int, design_interaction)

cont.matrix_int <- makeContrasts(
  smoking_tumor_vs_normal = smokertumor - tissuenormal,
  levels = colnames(design_interaction)
)

fit2_int <- contrasts.fit(fit_int, cont.matrix_int)
fit2_int <- eBayes(fit2_int)

summary(decideTests(fit2_int, p.value = 0.05, lfc = 1))
##        smoking_tumor_vs_normal
## Down                     13196
## NotSig                    5914
## Up                         238
smoking_tumor_vs_normal_results <- topTable(fit2_int, coef = "smoking_tumor_vs_normal",
                                            number = Inf, sort.by = "none",
                                            adjust.method = "BH", confint = TRUE) %>%
  arrange(adj.P.Val) %>%
  mutate(Significant = ifelse(adj.P.Val < 0.05 & abs(logFC) > 1, "Yes", "No"))

head(smoking_tumor_vs_normal_results)
##        GeneSymbol      logFC       CI.L       CI.R   AveExpr         t
## DDX17       DDX17  -8.855371  -9.462514  -8.248228  8.754419 -30.63963
## MALAT1     MALAT1 -13.633914 -14.644881 -12.622947 13.522385 -28.33033
## HLA-E       HLA-E -10.614887 -11.427669  -9.802104  9.273558 -27.43526
## CELF1       CELF1  -6.701365  -7.250620  -6.152110  6.625336 -25.63051
## RPL13A     RPL13A  -9.042184  -9.776572  -8.307795  9.463326 -25.86518
## SON           SON  -8.389915  -9.074861  -7.704969  8.288801 -25.73176
##             P.Value    adj.P.Val        B Significant
## DDX17  5.321434e-17 1.029591e-12 25.15109         Yes
## MALAT1 2.125990e-16 2.056683e-12 23.12180         Yes
## HLA-E  3.745022e-16 2.415290e-12 23.90174         Yes
## CELF1  1.240077e-15 2.999126e-12 23.49265         Yes
## RPL13A 1.056570e-15 2.999126e-12 23.29292         Yes
## SON    1.157088e-15 2.999126e-12 23.37851         Yes

There is a vast amount of significant genes for further analysis and exploration ## 5.2. Visualization of DE Results

MA Plots

MA plots for each comparison visualize the relationship between average expression and log-fold change.

MA Plot for Tumor vs Normal

# MA plot for tumor vs normal
tumor_vs_normal_results$ggplotSignificant <- ifelse(
  abs(tumor_vs_normal_results$logFC) >= 2 & 
    tumor_vs_normal_results$adj.P.Val < 0.05,
  "Yes",
  "No"
)

ggplot(tumor_vs_normal_results, aes(x = AveExpr, y = logFC)) +
  geom_point(aes(color = ggplotSignificant), alpha = 0.5) +
  scale_color_manual(values = c("No" = "grey", "Yes" = "red")) +
  labs(title = "MA Plot: Tumor vs Normal", x = "Average Expression", y = "Log Fold Change") +
  theme_bw(base_size = 10)

The result is promising and shows a large number of downregulated genes in tumor samples compared to normal tissue, with a few upregulated genes that may be of interest for further investigation. ### MA Plot for Smoking Tumor vs Normal

smoking_tumor_vs_normal_results$ggplotSignificant <- ifelse(
  abs(smoking_tumor_vs_normal_results$logFC) >= 4 & 
    smoking_tumor_vs_normal_results$adj.P.Val < 0.05,
  "Yes",
  "No"
)

ggplot(smoking_tumor_vs_normal_results, aes(x = AveExpr, y = logFC)) +
  geom_point(aes(color = ggplotSignificant), alpha = 0.5) +
  scale_color_manual(values = c("No" = "grey", "Yes" = "red")) +
  labs(
    title = "MA Plot: Smoking Tumor vs Normal",
    x = "Average Expression",
    y = "Log Fold Change"
  ) +
  theme_bw(base_size = 10)

There is a huge amount of strongly downregulated genes and a small bunch of upregulated genes, which definetely need further exploration, to find potential carcinogenic effects of smoking. ### MA Plot for Current vs Never Smokers

current_vs_never_results$ggplotSignificant <- ifelse(
  abs(current_vs_never_results$logFC) >= 2 & 
    current_vs_never_results$P.Value < 0.05,
  "Yes",
  "No"
)
ggplot(current_vs_never_results, aes(x = AveExpr, y = logFC)) +
  geom_point(aes(color = ggplotSignificant), alpha = 0.5) +
  scale_color_manual(values = c("No" = "grey", "Yes" = "red")) +
  labs(
    title = "MA Plot: Current vs Never Smokers",
    x = "Average Expression",
    y = "Log Fold Change"
  ) +
  theme_bw(base_size = 10)

Those results are not significant, but can be explored to mine for potentially interesting genes. The number of genes “significant” in scenario of exploring non corrected p-values is low, but the results are still worth exploring, especially to find potential markers of smoking status. ### MA Plot for Former vs Never Smokers

former_vs_never_results$ggplotSignificant <- ifelse(
  abs(former_vs_never_results$logFC) >= 2 & 
    former_vs_never_results$P.Value < 0.05,
  "Yes",
  "No"
)

ggplot(former_vs_never_results, aes(x = AveExpr, y = logFC)) +
  geom_point(aes(color = ggplotSignificant), alpha = 0.5) +
  scale_color_manual(values = c("No" = "grey", "Yes" = "red")) +
  labs(
    title = "MA Plot: Former vs Never Smokers",
    x = "Average Expression",
    y = "Log Fold Change"
  ) +
  theme_bw(base_size = 10)

Those results are not significant, but can be explored to mine for potentially interesting genes. The number of genes “significant” in scenario of exploring non corrected p-values is very low, but the results are still worth exploring. ### MA Plot for Current vs Former Smokers

current_vs_former_results$ggplotSignificant <- ifelse(
  abs(current_vs_former_results$logFC) >= 2 & 
    current_vs_former_results$P.Value < 0.05,
  "Yes",
  "No"
)
ggplot(current_vs_former_results, aes(x = AveExpr, y = logFC)) +
  geom_point(aes(color = ggplotSignificant), alpha = 0.5) +
  scale_color_manual(values = c("No" = "grey", "Yes" = "red")) +
  labs(
    title = "MA Plot: Current vs Former Smokers",
    x = "Average Expression",
    y = "Log Fold Change"
  ) +
  theme_bw(base_size = 10)

Those results are not significant and the number of genes “significant” in scenario of exploring non corrected p-values is very low. ## 5.3. Gene Ontology (GO) Enrichment Analysis

Function for GO Enrichment

We define a reusable function to perform GO enrichment analysis on a set of differentially expressed genes.

run_go_enrichment <- function(de_results, p_value_col = "adj.P.Val", p_cutoff = 0.05, fc_cutoff = 1) {
  sig_genes <- de_results %>%
    filter(.data[[p_value_col]] < p_cutoff, abs(logFC) > fc_cutoff) %>%
    pull(GeneSymbol)
  
  if (length(sig_genes) == 0) {
    cat("No significant genes found for the given criteria.\n")
    return(NULL)
  }
  
  entrez_ids <- mapIds(org.Hs.eg.db, keys = sig_genes, column = "ENTREZID",
                       keytype = "SYMBOL", multiVals = "first")
  
  ego <- enrichGO(gene = na.omit(entrez_ids), OrgDb = org.Hs.eg.db,
                  keyType = 'ENTREZID', ont = "BP", pAdjustMethod = "BH",
                  pvalueCutoff = 0.05, qvalueCutoff = 0.02)
  
  return(ego)
}

GO Analysis for Tumor vs Normal

go_tumor_vs_normal <- run_go_enrichment(tumor_vs_normal_results,
                                        p_value_col = "adj.P.Val",
                                        p_cutoff = 0.05,
                                        fc_cutoff = 2)

if (!is.null(go_tumor_vs_normal) && nrow(as.data.frame(go_tumor_vs_normal)) > 0) {
  dotplot(go_tumor_vs_normal, showCategory = 15) + 
    labs(title = "GO Enrichment: Tumor vs Normal")
}

The GO enrichment analysis for tumor vs normal tissue shows significant terms related to immune response, cell proliferation, and metabolic processes, which is consistent with the known biology of tumors. ### GO Analysis for smokers with tumor vs normal

go_smoking_tumor_vs_normal <- run_go_enrichment(smoking_tumor_vs_normal_results,
                                                 p_value_col = "adj.P.Val",
                                                 p_cutoff = 0.05,
                                                 fc_cutoff = 8)
if (!is.null(go_smoking_tumor_vs_normal) && nrow(as.data.frame(go_smoking_tumor_vs_normal)) > 0){
  dotplot(go_smoking_tumor_vs_normal, showCategory = 15) + 
    labs(title = "GO Enrichment: Smoking Tumor vs Normal")
}

There is high enrichment in terms associated with immune response, cell migration and angiogenesis in the tumor tissue, which is consistent with the literature. ### GO Analysis for Current vs Never Smokers

Notice that for this and subsequent smoking comparisons, we filter on the raw p-value as very few genes are significant after multiple testing correction. These results should be treated as exploratory.

go_current_vs_never <- run_go_enrichment(current_vs_never_results,
                                           p_value_col = "P.Value",
                                           p_cutoff = 0.05,
                                           fc_cutoff = 2)

if (!is.null(go_current_vs_never) && nrow(as.data.frame(go_current_vs_never)) > 0) {
  dotplot(go_current_vs_never, showCategory = 15) + 
    labs(title = "GO Enrichment: Current vs Never")
}

The difference between current and never smokers is not significant, but we can still visualize the top terms. The exploration of those terms in not very informative, but it somewhat shows the effect of smoking on gene expression, especially for immune response, hepaticobiliary system and effect on white blood cells which is consistent with literature. ### GO Analysis for Former vs Never Smokers

go_former_vs_never <- run_go_enrichment(former_vs_never_results,
                                          p_value_col = "P.Value",
                                          p_cutoff = 0.05,
                                          fc_cutoff = 1)
if (!is.null(go_former_vs_never) && nrow(as.data.frame(go_former_vs_never)) > 0) {
  dotplot(go_former_vs_never, showCategory = 15) + 
    labs(title = "GO Enrichment: Former vs Never")
}

The difference berween current and former smokers is not significant, but we can still visualize the top terms. The exploration of those terms in not very informative, but it somewhat shows long term effects on longevity, even after quitting smoking. Not significant yet, but worth exploring in the future. ### GO Analysis for Current vs Former Smokers

go_current_vs_former <- run_go_enrichment(current_vs_former_results,
                                            p_value_col = "P.Value",
                                            p_cutoff = 0.05,
                                            fc_cutoff = 1)
if (!is.null(go_current_vs_former) && nrow(as.data.frame(go_current_vs_former)) > 0) {
  dotplot(go_current_vs_former, showCategory = 15) + 
    labs(title = "GO Enrichment: Current vs Former")
}

There is no significant GO enrichment for the interaction between smoking and tumor status, as the results are not significant before after multiple testing correction. However, we can still visualize the top terms.

5.4. Heatmaps of Enriched GO Terms

Function for Plotting Heatmaps

We define a function to generate heatmaps for the top enriched GO terms.

plot_go_heatmaps <- function(voom_obj, go_results, de_results, annotation_col, 
                             n_top = 5, p_value_col = "adj.P.Val", p_cutoff = 0.05, fc_cutoff = 1) {
  if (is.null(go_results) || nrow(as.data.frame(go_results)) == 0) {
    cat("No GO results to plot.\n")
    return(invisible(NULL))
  }
  
  top_go_terms <- head(as.data.frame(go_results), n_top)
  expr_matrix <- voom_obj$E
  
  all_sig_genes <- de_results %>%
    filter(.data[[p_value_col]] < p_cutoff, abs(logFC) > fc_cutoff) %>%
    pull(GeneSymbol)
  
  for (i in 1:nrow(top_go_terms)) {
    term_description <- top_go_terms$Description[i]
    gene_ids <- top_go_terms$geneID[i]
    genes_in_term_entrez <- str_split(gene_ids, "/")[[1]]
    
    symbols_in_term <- mapIds(org.Hs.eg.db, keys = genes_in_term_entrez,
                              column = "SYMBOL", keytype = "ENTREZID",
                              multiVals = "first")
    
    genes_to_plot <- intersect(na.omit(symbols_in_term), all_sig_genes)
    genes_to_plot <- intersect(genes_to_plot, rownames(expr_matrix))
    
    if (length(genes_to_plot) < 2) {
      cat("Skipping heatmap for '", term_description, "' (fewer than 2 significant genes).\n")
      next
    }
    
    mat_subset <- expr_matrix[genes_to_plot, ]
    
    pheatmap(mat_subset, main = paste("Heatmap for:", term_description),
             annotation_col = voom_obj$targets[, annotation_col, drop = FALSE],
             scale = "row", show_colnames = FALSE, fontsize_row = 8)
  }
}

Heatmaps for Tumor vs Normal Comparison

plot_go_heatmaps(voom_obj = v, 
                 go_results = go_tumor_vs_normal,
                 de_results = tumor_vs_normal_results,
                 annotation_col = "tissue", 
                 n_top = 5,
                 fc_cutoff = 2,
                 p_value_col = "adj.P.Val",
                 p_cutoff = 0.05)

There is increased expression of genes associated with angiongenesis and cell migration in the tumor tissue, which is consistent with the literature. The heatmap also shows that the genes are downregulated in the tumor tissue, which is consistent with the tumor suppressor genes being downregulated in cancer. The FOXF1 transcription factor is downregulated in the tumor tissue, which suggests that it is not working properly, hence plays major role in development of the tumor and its angiogenesis. ### Heatmaps for carcinogenic effect of smoking on tumor tissue

plot_go_heatmaps(voom_obj = v_int, 
                 go_results = go_smoking_tumor_vs_normal,
                 de_results = smoking_tumor_vs_normal_results,
                 annotation_col = "tissue", 
                 n_top = 5,
                 fc_cutoff = 6,
                 p_value_col = "adj.P.Val",
                 p_cutoff = 0.05)

Heatmaps are showing the influence of smoking on development of the tumor tissue. There is visible donregulation of tumor suppresor gene EMP2 in the tumor tissue of smokers, which is consistent with the literature. The heatmap also shows that the genes are upregulated in the tumor tissue of smokers, which is consistent with the carcinogenic effect of smoking. There is also higher expression of COL1A1, which can be associated with angiogenesis.